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Abstract 

When studying a metastable dynamical system, a prime concern is how to de¬ 
compose the phase space into a set of metastable states. Unfortunately, the 
metastable state decomposition based on simulation or experimental data is 
still a challenge. The most popular and simplest approach is geometric clus¬ 
tering which is developed based on the classical clustering technique. However, 
the prerequisites of this approach are: (1) data are obtained from simulations 
or experiments which are in global equilibrium and (2) the coordinate system is 
appropriately selected. Recently, the kinetic clustering approach based on phase 
space discretization and transition probability estimation has drawn much atten¬ 
tion due to its applicability to more general cases, but the choice of discretization 
policy is a difficult task. In this paper, a new decomposition method designated 
as maximum margin metastable clustering is proposed, which converts the prob¬ 
lem of metastable state decomposition to a semi-supervised learning problem 
so that the large margin technique can be utilized to search for the optimal 
decomposition without phase space discretization. Moreover, several simulation 
examples are given to illustrate the effectiveness of the proposed method. 

Keywords: metastable states, large margin methods, clustering analysis, 
semi-supervised learning 


1. Introduction 

Metastability is an ubiquitous phenomenon which occurs in many complex 
systems in nature, including conformational transitions in macromolecules [5], 
autocatalytic chemical reactions |3] and climate changes |3]. Generally speaking, 
the metastability of a dynamical system means that the phase space of the 
system consists of multiple macrostates called metastable states, in which the 
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system tends to persist for a long time before transiting rapidly to another 
metastable state. In this paper, we focus on the problem of metastable state 
decomposition, i.e., how to partition the phase space of a given system into a set 
of ideal metastable states so that transitions between different metastable states 
are rare events, meanwhile, there is a considerable gap between the timescales of 
movements within and between metastable states. This is an important problem 
for analysis of metastable systems because: 

1. Metastable state itself is often physically interesting. For instance, for 
biomolecules, the metastable states are related to the basins of the free 
energy surface, and the metastable state decomposition is helpful for seek¬ 
ing the native states with lowest energy and kinetic trap states with locally 
minimal energy which can represent the large scale geometric structures 
of conformations (see e.g., [51|S]). 

2. The dynamical behavior of a metastable system can be approximately de¬ 
scribed as a transition network of metastable states on a slow timescale, 
and such approximate description is able to provide a simple and global 
picture of the system dynamics which preserves the essential dynamical 
properties. Fig. gives an example of metastable system and the its dy¬ 
namical approximation. Note that the metastable state based dynamical 
approximation is closely related with the Markov state models (MSMs) 
|7], where transitions between discrete states obtained from phase space 
discretization are assumed to be Markovian and the corresponding transi¬ 
tion probability matrices can be used to characterize the system dynamics. 
A large number of theoretical and experimental studies (see e.g., [3 [7]- 
m) demonstrated that a small-sized and satisfying MSM can be obtained 
by treating each metastable state as a Markovian state if the transition 
rates between different metastable states are sufficiently small compared 
to lifetimes of metastable states. Although the research focus of MSMs 
has shifted from metastability to spectral approximation and maximizing 
metastability has been shown to be not the optimal way to improve the 
accuracy of MSMs in recent years m, the metastable state decomposi¬ 
tion is still commonly used to construct MSMs in applications due to its 
ease-of-use. Moreover, it is required for some more advanced analysis and 
modeling techniques of metastable systems. For example, it was proved 
in 13 [in] that the quality of an MSM can be improved through fine dis¬ 
cretization around boundaries between metastable states, and the hidden 
Markov model analysis approach of metastable systems proposed in [T^ 
needs the metastable state decomposition for parameter initialization. 

Moreover, the metastable state decomposition is also applicable to unsupervised 
classification problems when data are sequentially sampled and the data cate¬ 
gories are rarely changed during sampling (e.g., classification of sensor data). 

For some low-dimensional systems, metastable states can be found manually 
by exploring their energy landscapes (see e.g., |13IfT5] b But for high-dimensional 
and complex systems, the intuitive partitioning of the whole phase space is gen¬ 
erally infeasible, and the metastable state decomposition can only be performed 
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(a) Potential energy landscape (b) Transition network of metastable state 

Figure 1: Illustration of dynamical approximation based on metastability de¬ 
composition. The left panel (a) shows the energy landscape of a stochastic 
diffusion system, which has two potential wells and can be decomposed into two 
metastable states Ai and A 2 according to the energy barrier. The right panel 
(b) shows the corresponding transition network obtained from the metastable 
decomposition. 


through statistical analysis of the simulation or experimental data. 

The simplest approach for data based metastable state decomposition is ge¬ 
ometric clustering, which detects the metastable states through grouping data 
points in phase space which are geometrically similar, and are usually imple¬ 
mented by classical clustering algorithms UMI] including fc-means, fc-medoids, 
Bayesian clustering and self-organizing maps. The theory basis of this approach 
is that metastable states generally correspond to “energy basins” in the phase 
space and the local maxima of the density function of the equilibrium state dis¬ 
tribution are then centers of metastable states. The application of the geometric 
clustering has two key limitations. First, it requires that the global equilibrium 
is reached in simulations or experiments so that the empirical distribution of the 
data set is consistent with the equilibrium distribution. Second, it is sensitive 
to the selection of the coordinates or projected coordinates of the phase space 
because a “bad” coordinate system may destroy the geometric structure of the 
energy landscape. (Note that for analysis of experimental data, the coordinate 
system is determined by the observation model of experiments and cannot be 
arbitrarily selected.) An alternative to the geometrical cluster approach is the 
density based clustering approach [5^[5!^ . which tries to measure the data point 
density directly and cut out regions of high data point density from data sets. 
It is, however, rather susceptible to statistical uncertainty and noise. 

A more general approach is kinetic clustering, which involves two steps: (1) 
discretize the phase space into small bins, and estimate the transition proba- 
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bilities between bins from simulation or experimental trajectories, (2) lump the 
discrete bins to metastable states through minimizing the transition probabil¬ 
ities between different metastable states. In contrast to the geometric cluster¬ 
ing approach, this approach implements clustering based on “kinetic similarity” 
rather than geometric similarity, and can effectively utilize the information of 
the system dynamics contained in data to improve the accuracy of the decompo¬ 
sition. Furthermore, since the kinetic clustering method is implemented based 
on transition probability distributions, it only requires that simulations or ex¬ 
periments are in local equilibrium instead of global equilibrium |24| . From the 
viewpoint of Markov chain theory, the kinetic clustering for metastable state 
decomposition is in fact a Markov chain compression problem [S]. 

The most popular representatives of this approach are Perron cluster cluster 
analysis (PCCA) and PCCA+ [3 [2S], which were developed based on the 

principle of spectral clustering by using transition probabilities to define similari¬ 
ties between bins. PCCA(+) methods have been widely used in applications due 
to their computational efficiency and the acceptable quality, but they are only 
applicable to time reversible systems. In m, a singular value decomposition 
based lumping algorithm for nonreversible systems was proposed. Furthermore, 
Jain and Stock [28] presented a “most probable path algorithm” for bin lump¬ 
ing in order to avoid the computation of large matrix decompositions, and a 
Bayesian lumping method was proposed in |2S] which considers the statistical 
uncertainty in transition probabilities. The main difficulty of kinetic clustering 
comes from the choice of bins, and the boundaries of metastable states are un¬ 
able to be accurately captured with a poor choice of bins. Generally speaking, 
the discrete bins used in kinetic clustering are still given by some geometric clus¬ 
tering algorithm where the system dynamics is not considered, and one can only 
improve the decomposition accuracy by adding more bins. But a large number 
of bins may cause overfitting problem in the transition probability estimation. 
Some recent publications isniisi] investigated the choice of bin number from the 
perspective of model comparison, but how to adjust the shape of bins to meet 
the requirement of metastable state decomposition is still an open problem. In 
|32| . an adaptive decomposition method was proposed, which iteratively imple¬ 
ments the space discretization and lumping to modify boundaries of metastable 
states. However, the convergence of this method cannot be guaranteed. 

The objective of this paper is to propose a new clustering method for meta¬ 
stable state decomposition based on the large margin principle. Recently, large 
margin techniques have received increasing attention in the machine learning 
community |33[ 134] . For a given supervised or unsupervised classification prob¬ 
lem, large margin techniques can improve the robustness and generalization ca¬ 
pability of the classifier through maximizing the margin between training data 
and classification boundaries. In this paper, we propose an optimization model 
called maximum margin metastable clustering for metastable state decomposi¬ 
tion by combining the large margin criterion and metastability criterion, which 
can effectively utilize both the geometric and the dynamical information con¬ 
tained in data, and develop a two-stage algorithm for searching the optimal 
decomposition by combining global search and local search techniques. In com- 
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parison with the previous decomposition approaches, this decomposition method 
directly constructs the boundaries of metastable states in the continuous phase 
space without any pre-discretization, and can achieve a reliable and robust de¬ 
composition based on the the large margin criterion even for small-scale data 
sets. 

2. Preliminaries 

In this section, we review briefly some of the relevant background on large 
margin learning. Given a set of training data and their class la¬ 

bels {yn}n=ii where each x„ is a data point sampled from a domain X and 
yn G {1, • ■ •) the support vector machine (SVM) finds a hyperplane classifier 
defined by the decision rule 

y = argmax (x) -t bk) (1) 

l<fc<K 

which can map the training data to correct labels and achieve strong generaliza¬ 
tion performance through maximizing the classification margin between training 
data and decision boundaries. Here </> is a mapping function from X to a, feature 
space and is generally induced by a Mercer kernel function Ker (•, •) with 

Ker (x, x') = 0 (x)^ 0 (x') (2) 

The optimal decision rule of SVM can be obtained by solving the following 
optimization problem |33| : 

min i ||Wf 

s.t. Vn = 1,..., fV, k = l,...,K, (^) 

where W = (w]^,..., wJ)t and b = (6i,..., . The constraint of ([^, called 

large margin constraint, states that for each training data x„, the decision func¬ 
tion value of x„ for the true class exceeds the decision function value for 
any other competing class fc ^ by at least one, and ||W|| can be shown to 
be inversely proportional to the classification margin in the feature space under 
the large margin constraint. Note that (|^ is a quadratic programming and 
the global optimum can be efficiently found by convex optimization methods. 
However, in many practical cases, there is no feasible solution which satisfies 
the large margin constraint exactly due to the linear non-separability of training 
data, and we can only seek a large margin classifier which separates the data 
“as much as possible”. In order to achieve this, we can introduce a slack variable 
for each x„ and modify the optimization problem ^ as 

mm i/3||\vf+ 

s.t. Vu = l,...,iV, k=l,...,K, 

(wT^ - wl) cj) (x„) -k {by^ - bk) + > 1 - Cn 


( 4 ) 
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where ^ = (^i,... ,^Ar)^ and /3 > 0 is the regularization parameter. It can be 
observed from constraints of Q that > 0 for all n, and > 0 if and only 
if (x„) + — (w^^(x„) + 6fe) > 1 does not hold for some k ^ n. 

Therefore we can conclude that represents the misclassification loss of x„, 
and the objective function of @ balances the empirical risk on the training 
data versus the classification margin. 

Remark 1. It is worth pointing out that SVM was originally designed for two- 
class classification jSS]. A variety of strategies have been proposed to extend 
the SVM for multi-class problems, such as one-against-all [55], one-against-one 
m and error-correcting output coding |38| . Here we select the multi-class 
SVM proposed in m which minimizes the the multi-class margin loss directly, 
because it is more “logical” than the other strategies from the perspective of large 
margin learning and can be easily applied to unsupervised learning problems (see 
below). 

Maximum margin clustering (MMC) |391140) extends the maximum margin 
principle to unsupervised learning, i.e., data clustering. For a set of unlabeled 
data {x„}, MMC targets to construct a maximum margin decision rule by opti¬ 
mizing 0 with both (W, b) and data labels {?/„} being decision variables. But 
in this unsupervised case, the optimization problem Q has a trivially optimal 
solution with ?/„ = 1 and ||W|| = 0, which implies that all data are assigned to 
the same class and an infinite classification margin is obtained. To prevent such 
physically meaningless solutions with empty classes, the following class balance 
constraint is required for MMC: 

N 

giN < '^ ly^^k < QuN, Vfc=l,...,K ( 5 ) 

m—1 

where gi , denote the lower and upper bounds of the proportion of each class 
size, and satisfy 0 < gi < 1 /k < gu < I- Then, the complete MMC problem 
(with slack variables) can be formulated as 

min 
y,w,b,« 

s.t. 


with y = (j/i,..., un) & {!,..., k}^. In contrast with the SVM problem Q, 
([^ is a nonconvex mixed-integer problem and much more difficult to solve. The 
existing optimization methods for the MMC problem can be roughly catego¬ 
rized into two types. The first type of method relaxes the MMC problem into 
a semidefinite programming (SDP) problem that can be globally optimized by 
standard SDP solvers jiniiii]. However, this type of method is very computa¬ 
tionally expensive and can only be applied to small data sets. The other type of 
method utilizes some local search algorithms, such as concave-convex procedure 
|42] and alternating optimization |45| , to solve the MMC problem directly, which 


i/3|iWf+ 

Vn = 1,..., TV, k = 1,..., K, 

{^yr^ ~ ’^fc) ^ (^") + (^!/^ “ = k > 1 - Cn, 

giN < ELi < 9uN. 
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is more efficient than the relaxation method but usually suffers from undesired 
local minima. 

Remark 2. The balance constraint ([^ is also useful to avoid the problem of 
separating a single outlier (or very small group of outliers) from the rest of the 
data, and allows one to incorporate the prior knowledge on the balance of data 
distribution. The theoretical analysis and empirical research on the balance 
constraint is still very limited and the parameters gi , are usually selected by 
trial and error. According to our numerical experience, the parameters can be 
simply set as [quQu] = [e, 1 — e] with e = 10“^ ~ 10”^, and the value of e 
influences only a little on clustering results. Moreover, it is worth pointing out 
that the balance constraints [gi, = [0,1] and [gi, gu] = [e, 1 — e] with e being 
positive and sufficiently small are completely different in essence for MMC. The 
former one is always an “inactive” constraint and the optimal decision boundary 
is infinitely far away from all the data, and the latter one enforces that the 
optimal decision boundary of MMC must pass through the data set. 

Remark 3. For some commonly used Mercer kernel functions (e.g., Gaussian 
kernels), the dimensions of the corresponding feature spaces are extremely high 
or infinite, and the direct feature mappings are impossible. In such a case, 
the kernel trick can be used to solve the SVM or MMC problem by perform¬ 
ing the feature mapping implicitly based on the kernel matrix K = [ATy] = 
[Ker(xi,Xj)] € (see [331 HO] for details). While the kernel trick has been 

widely and successfully applied in large margin learning, the calculation of ker¬ 
nel matrices is a bottleneck of the kernel trick for large-scale data sets. In recent 
years, a lot of alternatives to the kernel trick have been proposed to reduce the 
computational and storage costs (see, e.g., |431l46] 1. which can approximate the 
induced feature mapping </> by a low dimensional function cj) (x) such that 

</> (x)"*^ c/) (x') = Ker (x, x') « ^ (x)'*’0 (x'), Vx,x'G A (7) 

Therefore, in this paper, we only consider the case that the feature mapping <p 
can be explicitly defined and computed. 


3. Maximum margin metastable clustering 

In this section, we apply the framework of large margin learning to metasta¬ 
bility analysis. Suppose that we have L trajectories {x},..., x)^^^}, {xf,..., x^^}, 
..., {xf,..., in a phase space X which are generated by simulations or 

experiments of a dynamical system, where the point x( represents the system 
state at time t in the Tth run, and Mi denotes the time length of the l-th run. 
Furthermore, for convenience of notation, we define 

r = {(x„,x„)}()Li = {(x',x'+i)|l <l<L,l<t<Mi} (8) 

as the set of all transition pairs that appear in trajectories. The purpose of 
metastable state decomposition is to partition X into k subspaces (metastable 
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states), so that it is a rare event - not only for the given L trajectories but also 
for new trajectories generated by the same system - that X( and xt+i belong to 
different metastable states. 

Remark 4. An important issue for metastable state decomposition is determin¬ 
ing the value of k. But this issue is beyond the scope of this paper and we 
simply assume that k is given. In practical applications, k can be obtained by 
analyzing of life times of metastable states (see for details). 

Analogous to learning in SVM and MMC, we can construct a large margin 
decision rule for metastable state decomposition based on the following criteria: 

1. Metastability criterion. For most of state transition pairs (x„, x„) G V, 
x„ and x„ should be classified to the same metastable state, which implies 
that the boundaries between metastable states are rarely crossed in runs 
of the system. 

2. Large margin criterion. The metastable state boundaries should be 
placed as far away from the trajectory data as possible in order to improve 
the generalization performance of the decomposition result for unknown 
trajectories. 

The two criteria leads to a maximum margin metastable clustering (M^C) 
method that minimizes ||W|| under the following large margin metastable con¬ 
straint: 


~ ^ (^") + - ^k) + ^y,^=k > 1 

- ^l) ^ + {by„ - bk)ly^=k > 1, Vn,fc (9) 


where G denotes the metastable state label of x„ and x„. It can 

be seen that M^C is in fact a semi-supervised learning method which can be 
interpreted as follows: 

1. The constraint enforces x„ and x„ being assigned to the same metastable 
state for each state transition pair (x„,x„). 

2. For any x G U^=i{^nj is assigned to the metastable state y, 

the the distances between x and the boundaries of y are 



for k ^ y (10) 


Then the margin between data and decision boundaries can be increased 
by minimizing the objective function ||W|| of M^C. 

Moreover, it is interesting to note that the M^C method can be expressed as a 
specific MMC method in the space of transition pairs (x„,x„), which seeks a 
decision rule 


{y, y) = argmax (x, x) -f 


( 11 ) 


under a large margin constraint 



y„=k^k > 1 ( 12 ) 
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where 


'kk 


= ( 




^kk — 


bj, + bk and ^(x„,x„) = (</>(x„)t, </>(x„)t)i 


denotes the extended feature function. (The proof of the equivalence between ([^ 
and (121 is given in Appendix [M|) The decision rule ( |TI| ) can map a transition 
pair in the phase space to a pair of metastable state labels, and the constraint 


(12 1 restricts that all transition pairs in V are not only far away from the decision 


boundaries in the product feature space 4>{X) x 4>{X) but also enclosed in the 
area U^^i{(x, x)|(A:, fc) = argmax^ ^ u;l^,</>(x, x) + 6 ^^}. 

Based on the above discussion, as well as by introducing slack variables and 
the class balance constraint as in MMC, we can formulate the M^C problem as 


mm 

y,w,b,« 

s.t. 


i/3||Wr + 

Vn = 1,..., A^, Vfc, k = 1,..., K, 

-^L) 

i^Vnyn ~ ^kk'j ^y^=k=k ^ 1 ~ ^711 

eiN < El=i K^=k < QuN. 


(13) 


4. Comparison of maximum margin metastable clustering with rela¬ 
tive methods 

Fig-i shows a two dimensional and two-class example to illustrate the dif¬ 
ference between the ideas of MMC and M^C (or the other geometric clustering 
methods). In this example, all the trajectory data are distributed in areas Ai, 
A 2 and A 3 , and Ai is far away from A 2 and A 3 . As shown in Fig. the 
optimal decision boundary founded by MMC is a horizontal line for which can 
achieve the maximum classification margin, and then MMC decomposes the 


three areas as {Ai} U {A 2 , A 3 }. Fig. 2b displays both data points and trajecto¬ 


ries. As can be seen, the MMC decomposition severely violates the large margin 
metastable constraint because of the existence of frequent switches between Ai 
and A 2 in trajectories, and the optimal decomposition provided by M^C (with 
an appropriate class balance constraint) is then {Ai, A 2 }U{A 3 }. Obviously, the 
decomposition result of M^C is more reasonable from the view of metastability 
analysis. From this example we can observe that the M^C method is able to 
exploit the information on both phase space distribution and metastable dy¬ 
namics contained in the trajectory data by using the large margin metastable 
constraint. 

Compared with the kinetic clustering methods of metastable state decom¬ 
position (e.g., PCCA+), the M^C method directly minimizes the “loss” caused 
by boundary crossings between metastable states without an explicit model of 
the system dynamics. Therefore, plenty of statistical problems which are essen¬ 
tial and difficult for kinetic clustering methods, including the choice of space 
discretization and the estimation of transition probabilities, are not present in 
M^C. Furthermore, the large margin criterion used in M^C provides a convenient 
and powerful way to reduce the influence of statistical noise. Although some 
kinetic clustering methods can also handle the statistical noise in a Bayesian 
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(a) MMC 
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(b) M^C 


Figure 2: Illustration of decision boundaries obtained by MMC and M^C, where 
solid lines are decision boundaries, dashed lines show the margins, dash-dot lines 
represent phase space trajectories, and circles denote data points sampled from 
the trajectories. 


manner (see e.g., [IS]), M^C offers the advantage of naturally achieving good 
performance of generalization and robustness without assuming any prior dis¬ 
tribution. 

The major challenge of M^C arises from (131, which is a nonconvex mixed- 
integer optimization problem as the MMC problem Due to the similarity 
between the MMC and M^C problems, solving of the latter encounters similar 
difficulties: the SDP relaxation based global search requires a high computa¬ 
tional cost and the local search easily gets stuck in poor local optima. 


5. Optimization method for metastable maximum margin clustering 


In this section, we will propose a coarse graining based strategy to solve the 
M^C problem (13) that takes advantages of both global search and local search 
methods, and is applicable to large-scale data sets. The proposed optimization 
procedure can be sketched in the following steps: 


1 . Coarse graining. Discretize the space X x X oi transition pairs into 

{k < N‘^ <C N) bins and approximate P by a coarse-grained data 
set with normalized weights {c„}, where (x((,x(j) 

denotes the center of the n-th bin and c„ is proportional to the number 
of transition pairs contained in the n-th bin. This step can be done by 
A:-means, fc-medoids or any other traditional clustering algorithm, and the 
value of is chosen based on the limitation of computational resources. 

2. Global search. Perform M^C on the coarse-grained data set by using 
the SDP relaxation method. Note that only has a small number of 
distinct elements, then a satisfactory solution can be achieved in this step 
without too much computing burden. 






















5.1 Global search 
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3. Local search. Refine the solution obtained in the global search step by 
applying some local search algorithm to the M^C problem on V. Here we 
select the alternating optimization presented in |43] to perform the local 
search because it is computationally efficient and can handle the class 
balance constraint easily. 

Below we will discuss the last two steps in more detail. 


5.1. Global search 

For convenience of analysis and computation, we confine that all decision 
hyperplanes pass through the origin in the feature space, i.e., bi = ... — b,.^ = 0. 
(This is a mild restriction especially for high dimensional feature spaces, since 
it reduces the number of freedom degrees of decision hyperplanes only by one.) 
Then the M^C problem on can be expressed as 

min i/? ||W||^ + 

s.t. Vu = 1,..., Vfc, fc = 1, • ■ •, K, 

Ql ^ Sm=l = k ^ Qu- 


where y'^ G and denotes the label and slack variable of the u-th 

coarse-grained transition pair (x^, x^), = (yj,..., ,..., 

and c = (ci,..., CArc)T denotes the vector of weights of coarse-grained transition 
pairs. It is clear that the assignment y'^ in ( |l4| ) can be represented by the 
equivalence relation matrices D S and M € which are defined 

by 

D = [D^j] = [ly^=j], M = [M,,] = (15) 


i.e., the (i,j)-th element of D indicates if the label of (x(^,x°) is j, and the 
element of M indicates if (x°,x(^) and (x°,xp are assigned to the same 

label. 

Replacing by (M, D) as assignment variables in (141 and using the strong 
duality theorem, we can get an equivalent form of the coarse-grained M^C prob¬ 
lem. 


Theorem 5. Solving (141 is equivalent to solving the optimization problem 

lew - cTa - ^tr (MCK®C) -f ITc 


mm 

M,D,a,6» 

S.t. 


q (D) -I- (1 (g) I) a + R0 < 0, 

Ql < Me < Qu, 
diag (M) = 1, 

M = DDR 

M e {0,1}^"''^" ,D e {0,1}^"''”. 


(16) 


where C is a diagonal matrix with the elements of c on the diagonal, q (D) is a 
linear function o/D defined in (B.IO), and definitions o/K® and R are given 
by (B.6 l and (B.12I. 
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Proof. See [Appendix B| 


□ 


Remark 6. The last four constraints on M and D of (16) come from the fact 
that there is a y'^ such that ( |15[ ) holds if and only if M and D are both binary 
matrices and satisfy diag (M) = 1 and M = DD’’’ |34| . 


It is easy to see that (161 is a convex optimization problem if we drop the 


constraints that M and D are binary matrices and the nonlinear constraint 
M = DD’’’. This motivates an approximation method for solving the coarse¬ 
grained M^C problem based on two relaxations as in [31]. The first relaxation 
allows elements of M and D to take values in [0,1] instead of {0,1} and the 
second relaxation is to replace M = DD'’' with a convex inequality M ^ DD'’'. 


By using these two relaxations, (161 can be relaxed to a convex optimization 
problem: 


(17) 


min i6>T0-cTa-,Ttr(MCK®C)-|-lTc 

s.t. q (D) -I- (1 (g) I) a + R0 < 0, 
gi < Me < Qu, 
diag (M) = 1, 

M ^ DDT, 

0<M<1,0<D<1. 

According to the Schur complement lemma |47| . we can further reformulate (jT 
in a more convenient form: 


mm 

M,D,a,0,C 

s.t. 


c 

I e 

6>T 2(c + cTQ:-t-^tr(MCK*C)-lTc 

q (D) + (1 (g) I) q: -f R0 < 0, 

Ql < Me < Qu, 
diag (M) = 1, 

I DT 


0 , 


(18) 


D M 


^ 0 , 


0<M<1,0<D<1. 


It is an SDP problem and can be solved in polynomial time. 


Since the optimal M obtained from (181 is a symmetric matrix satisfying 
diag (M) = 1 and 0 < M < 1, it can be interpreted as a similarity matrix of 
with each element Mij representing some measure of the similarity between 
(x(^,x(^) and (xj,Xj). Hence, we can utilize the spectral clustering method to 
recover y'^ from the optimal M such that = lj/==y‘: approximately holds for 
all i,j. 

Remark 7. Obviously, the relaxation method proposed in this section can be 


directly applied to the original M^C problem (13) without any coarse graining. 


and the corresponding relaxed problem can be solved even in the case that we 
do not know the feature mapping implicitly but only know the kernel function 
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Ker (•,•). However, this scheme is generally computationally infeasible because 
the it involves an SDP problem with 0{N'^) parameters. 


5.2. Local search 

We now investigate how to refine the clustering result obtained in the global 
search by local search. A natural way to do this is to alternatively minimize 
(131 with respect to (W,b) keeping y fixed and vice versa until convergence, 
where the initial value of y is given by the coarse-grained label assignment y'^ 
with 

Un = Di, if (xn,x„) is in the bin centered at (x^,x°) (19) 


Note that the M^C problem (131 with a fixed y is just a standard quadratic 
programming problem. Here we only discuss the optimization problem with 
respect to y. For fixed W and b, (131 is reduced to 

Vn = 1,..., A', Vfc, k = 1,..., /v, 

^(x„,x„) (20) 


mm 

y 

s.t. 


^ 

(^VnVn ~ A ^ 1 ~ 

< Em=l < QuN. 


It is simple to verify that (20) can be transformed into a binary linear program¬ 
ming problem 


mm 

D/ 

S.t. 


tr (HTDf) 

qiN < ITD-f < QuN, 
D-f G {O,!}^""”. 


( 21 ) 


where D-f = [d(.] is a relation matrix with £>/ = 1^^=^, and H = [Hij] 


IS 


defined by 

Hji = max 1 — 1 


k.k 


j—k—k 


(wF - ck (x„ Xi) - (bjj - (22) 


Although (211 belongs to the class of NP-hard problems, it can be efficiently 
tackled by enumeration and cutting-plane techniques |48| in practice, and there 
exist a lot of software packages for solving large-scale integer programming prob¬ 
lems like (211, including MOSEK [49], Gurobi [50] and GLPK |51| . 


Remark 8. It was reported in m that the alternating optimization method for 
MMG often suffers from premature convergence and can only change a small 
proportion of data labels even with a poor initialization, but our numerical ex¬ 
periments show that this problem is not serious for the local search procedure 
of M^G. The analysis of this phenomenon requires further investigation and we 
only give a rough explanation here. Generally speaking, the switching between 
metastable states can be observed many times in trajectories. Therefore, dur¬ 
ing the local search of M^G, there are always a number of transition pairs that 
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Algorithm 1 Optimization procedure for M^C 


1 : generate a coarse-grained transition pair set with cardinality and 
normalized weights c = (ci,..., CArc)T from V by the A:-means or fc-medoids 
algorithm 

2 : find (M* ^ 

to get 


D*) as the solution to the SDP problem (18) 

3: perform the spectral clustering algorithm with similarity matrix M' 
class labels y'^ = (yf,..., y%c) of coarse-grained transition pairs in 
4: calculate class labels y = (yi,... ,yN) of transition pairs in V from by 

5: initialize with y^'^^ = y and r = 1 

6 : repeat 

7: find (W*, b*) as the solution to the quadratic programming problem ( 13 1 

with y fixed to be and set = (W*,b*) 

8 : find = [D{*] as the solution to the binary linear programming prob¬ 
lem (21) with (W,b) set to be b^’’^). 

9: calculate y* = (y*,..., y%) by y^ = argmax^- D^* and set y(") = y* 

10: set r := r -I- 1 

11 : until the Hamming distance between y^’’^ and is less than or equal 

to a given constant an or r is larger than a pre-defined threshold rmax 
12 : return y^’’^) 


are close to the decision boundaries and violate the large margin metastable 
constraint with not-so-small slack values even if /3 <C 1. These misclassified 
transition pairs are helpful to “push” the decision boundaries away from the ini¬ 
tial positions and their labels can be easily changed especially in early iterations 
of the local search. 

5.3. Full description of the optimization procedure 

Based on the above analysis, the complete optimization method developed 
for M^C can be summarized in Algorithm [l] 


6. Experiments 

In this section, we demonstrate the performance of the proposed decompo¬ 
sition method on some synthetic metastable systems and molecular dynamics 
simulations. 

The detailed settings of Algorithm [l] for M^C are as follows: 

• In the coarse graining step, is provided by the standard fc-medoids 
algorithm with = 30. (Here we use fc-medoids rather than fc-means 
because A:-medoids is more robust to outliers and can avoid the appearance 
of the coarse-grained transitions which make no physical sense.) 
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All the involved SDP problems and the binary linear programming prob¬ 
lem (211 are solved by using the Mosek solver jlHI through the CVX in¬ 


terface in Matlab 


• The spectral clustering algorithm proposed in 
from the relation matrix M. 


is utilized to extract y'^ 


The feature mapping (p (■) is induced from the Gaussian kernel 


Ker (x, x') = exp — 


|x — X 

2(t2 


(23) 


and explicitly computed by the random Fourier method |44| with d = 50. 
The width parameter a is determined by a grid search over {2“^, 2“^,..., 2^} 
and we select the value of a which leads to the minimal value of the ob¬ 


jective function of the M^C problem (13). 


• The value of regularization parameter /3 does not have a significant effect 
for our experiments, so it is simply fixed to 0.01. 

• The parameters of termination condition are set to be an = 0 and r^ax = 
100, and the parameters of class balance constraint are {gi, Qu) = (0.01,0.99). 


For comparison purposes, the following three decomposition methods are also 
considered in our experiments: 

1 . /c-medoids clustering, which directly decomposes the phase space into k 
macrostates based on the set S = {x(|l < I < L,1 < t < Mi}. 

2. MMC based on S, where the MMC problem is solved by a mixed algorithm 
similar to Algorithm (see [Appendix C I. 

3. PCCA+ |26| . where the implementation details are given in Appendix 
[Appendix^ 

Remark 9. fc-medoids clustering and MMC can be viewed as geometric cluster¬ 
ing methods in metastability analysis. 

Remark 10. Considering the inherent randomness of the fc-medoids algorithm, 
here we perform the fc-medoids clustering (or fc-medoids clustering in coarse 
graining steps of MMC, PCCA+ and M^C ) in the following way: Repeat the 
fc-medoids algorithm 100 times independently with random initialization and 
pick the solution with the minimal “within-cluster point scatter”. 


6.1. Sequential unlabeled data 

Here we apply M^C and other metastable state decomposition methods to 
sequential data which are generated by using data sets letter, satellite, spam- 
base, waveform and segment from the UCI machine learning repository |55| . 
All data sets consist of multiple classes of instances and the pattern of each 
instance is represented by a multidimensional vector of real- or integer-valued 
features. A summary of all the data sets is in Table 

For each data set, we construct a sequence of unlabeled patterns as follows: 
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1. Generate a reversible Markov chain {yn}n=i in {1,, k}, where k denotes 
the class number of the data set, and the transition matrix P = [Pij] = 
[Pr(j/„+i = j|y„ = z)] e of {y„} is given by 



0.97 0.03 





0.03 0.97 





0.97 

0.015 

0.015 ■ 



0.025 

0.95 

0.025 



0.02 

0.02 

0.96 



0.9517 

0.0198 

0.0138 

0.0147 

0.0198 

0.9509 

0.0134 

0.0159 

0.0138 

0.0134 

0.9535 

0.0193 

0.0147 

0.0159 

0.0193 

0.9501 



for K = 2, 3,4 repsectively. 

2. For every n = 1,..., N, randomly select a pattern x„ with class label 
from the data set without repetition. (This step cannot be implemented if 
the element number of {n|y„ = i} is bigger than the data size of the z-th 
class in the data set for some z. For such a case, we will repeat generating 
{ijn} until this step is feasible.) 

Since the self-transition probabilities in transition matrices P (i.e., the diagonal 
elements of P) are all close to 1, a pattern sequence {x„} generated as above can 
be viewed as a metastable processes with each class being a metastable state. 
Then we perform clustering of {x„} by metastable state decomposition after 
removing the class labels of {x„}. Finally, the clustering accuracies of different 
methods are evaluated by calculating classification errors on the training data 
set {x„} and the testing data set consisting of all instances not in {x„}: 


61 ’^train 


errtest 


1 

N 


N 

n—1 


1 

[{testing data}! 


E 


l!/(x)#y(x) 


xG{testing data} 


(24) 


where y (x) denotes the true class label of x, and y (x) denotes the predicted 
label given by metastable state decomposition results. 

Table [^summarizes clustering errors on the various data sets with N = 1000. 
It can be observed from the table that both PCCA+ and M^C outperform the 
geometric clustering methods, fc-medoids and MMC, by utilizing the metastable 
structure in the sequential, and M^C achieves the best clustering performance 
among all the four methods. Moreover, considering that the clustering given by 
PCCA+ depend on the space discretization results (see Section [^ , we report 
the clustering errors of PCCA+ with different numbers of discrete bins in this 
table. As can be seen, for most data sets, except segment, either the finest 
discretization (with 400 bins) or the coarsest discretization (with 50 bins) cannot 
lead to the best PCCA+ clustering results, which demonstrates the sensitivity 
of PCCA+ to the choice of space discretization. 









6.2 Diffusion models 


17 


Table 1: Summary of the data sets 



size 

pattern dimension 

number of classes^-^(«;) 

letter 

1555 

17 

2 

satellite 

2236 

36 

2 

spambase 

4601 

57 

2 

waveform 

5000 

21 

3 

segment 

1320 

19 

4 


For data sets which contain more than k classes, we only use 
their subsets consising of the first k classes. 


Table 2: Means and standard deviations of clustering errors (in percent) calcu¬ 
lated over 20 independent experiments of various data sets 



A;-iiiedoids 

MMC 

PCCA+ 

(50 bins) 

PCCA 
(200 bins) 

PCCA+ 

(300 bins) 

PCCA^ 

(400 bins) 

M^C 

letter 

errtraln 

8.6306 ± 3.0850 

7.0090 ±3.2752 

2.2450 ±1.0128 

0.5450 ±0.1849 

0.4450 ±0.2585 

0.5950 ±0.2837 

0.1100 ±0.0852 

errtest 

8.6400 ± 3.0025 

7.3600 ±3.6059 

2.7207 ±1.4,567 

1.1171±0.5746 

0.7658 ±0.5871 

1.0270 ±0.6299 

0.2342 ±0.1963 

satellite 

errtrain 

2.3503 ±0.8560 

2.3058 ± 0.8616 

0.7727 ± 0.2905 

0.6600 ± 0.2741 

0.7600 ±0.2604 

0.8600 ±0.2501 

0.0900 ±0.0912 

errtest 

6.7950 ± 1.3069 

6.6450 ±1.4365 

1.1950 ±0.3348 

0.7727 ±0.3457 

0.9102 ±0.3803 

1.0720 ±0.5436 

0.3196 ±0.1325 

spambase 

errtrain 

36.8773 ±1.8360 

22.5923 ±9.7192 

23.6000 ±9.2388 

25.7450 ±13.4768 

22.7950 ±12.4952 

28.3671 ±8.2107 

8.5296 ±1.2141 

errtest 

45.0150 ± 4.2245 

23.7650 ±7.4026 

24.7362 ±7.8620 

27.1355 ±9.7611 

23.9572 ± 8.3935 

30.2900 ±15.3507 

12.2350 ±1.0338 

waveform 

errtraln 

46.6212 ±2.8506 

34.1600 ±16.3716 

23.5425 ±2.6599 

21.5525 ±2.9225 

26.1075 ±8.1388 

30.3725 ±13.9271 

15.0550 ±7.1295 

errtest 

47.4550 ±2.0623 

37.4900 ±16.0950 

27.2483 ±3.7676 

27.6700 ±4.5708 

33.6117 ±9.91157 

36.54,50 ±13.2638 

22.2675 ±10.4104 

segment 

errtrain 

21.2417 ± 11.3240 

8.1050 ±3.9986 

5.4500 ±1.0273 

2.6050 ±0.4729 

2.4850 ±0.8845 

2.3400 ±0.6652 

1.3750 ±0.3059 

errtest 

31.9750 ±7.9013 

18.3234 ±18.1635 

14.9662 ±12.2071 

10.2918 ±12.3009 

10.5916 ±12.2503 

9.1230 ±9.8576 

2.5392 ±1.6581 


6.2. Diffusion models 

We now consider two examples of time-reversible diffusion processes, denoted 
by Model I and Model II, which can be described by two dimensional Fokker- 
Planck equations (see Appendix E for details). 

For metastable state decomposition of Model I, we generate 10 trajecto¬ 
ries by 10 independent simulations with time length 80, sample interval At = 
0.2 and initial states Xq distributed according to an uniform distribution on 
[-1.5,1.5]^ Fig .|^shows the potential energy and simulation trajectories, where 
Xt = (xj^^xp^) denotes the system state at time t, the potential function V (x) 
is defined by tt (x) ex exp (—E (x)) and tt (x) is the equilibrium distribution 
limi_>oop(xt = x). We can observe that Model I has 6 potential wells, and the 
energy barrier between the “upper” wells (with > 0) and the “lower” ones 
(with < 0) can be easily crossed. Therefore, the phase space of Model I 
can be decomposed into 3 metastable states with each one containing an up¬ 
per and a lower potential well. Fig. displays the decomposition results of 
all four methods with k = 3 by using the trajectory data shown in Fig. |3b| 
where the bin number of PCCA+ is set to be 10. It is obvious that M^C and 
PCCA+ accurately identify the three metastable states, while the macrostates 
given by fc-medoids and MMC do not exhibit strong metastability although the 
decomposition results of the latter two methods are “reasonable” in the sense of 
geometric clustering. This shows the limitation of geometric clustering methods 
in the case that the spatial structure of metastable states does not only depend 
on the shape of equilibrium state distribution function. 

Note that the decomposition result displayed in Fig. 1^ might not be the 
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1.5 



x(i) xd) 

(a) Potential function (b) Trajectory data, where dotted lines repre¬ 


sent trajectories of xt and circles denote data 
points sampled from trajectories. 


Figure 3: Illustration of Model I. 


global optima of the MMC problem. It is natural to ask if MMC can correctly 
find the metastable states by choosing a better initial solution for the local 
search procedure. In order to answer this question, we solve the MMC problem 
by the local search algorithm in |43| starting from the decomposition provided 
by M^C . Fig. § shows the corresponding decomposition result, which is very 
similar to that obtained by M^C and PCCA+. However, the objective function 
value of this decomposition in the MMC problem is 0.0652, which is larger 
than the objective function value 0.0496 of the decomposition shown in Fig. 
This indicates that MMC prefer the decomposition in Fig. |^to that in Fig. [5 
although the latter one is better for metastability analysis. 


Moreover, we also perform the local search algorithm to solve the M^C prob¬ 
lem with random initialization, and Fig. plots the optimization result. As 
observed from the figure, the randomly generated initial solution leads to the 
algorithm getting stuck in a local optimum. (The optimal objective function 
values of the M^C problem obtained by the local search with random initial¬ 
ization and the mixed-algorithm proposed in Section [53| are 0.1553 and 0.0967 
separately.) It shows that the local search algorithm proposed in Section 5.2 


is sensitive to initial conditions, and some heuristic method (e.g., the global 
search algorithm presented in this paper) is needed to provide a satisfactory 
initial solution. 


The potential energy of Model II is shown in Fig. We generate 50 tra¬ 
jectories in the state space of Model II in order to test the metastable state 
decomposition methods (see Fig|7b[), where the time length of each trajectory is 
1, the sample interval is At = 0.02 and the initial states are uniformly sampled 
from [—2, 2]^. It is easy to see that Model II has three metastable states formed 
by the three potential wells. 


show that fc-medoids and MMC fail to identify the three metastable states in 


Fig-i summarizes the decomposition results with k = 3. Figs, and ^ 
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3,(1) 3.(1) 

(c) MMC (d) PCCA+ (with 10 bins) 


Figure 4: Decomposition results of Model I, where white lines represent bound¬ 
aries between macrostates. The boundaries are computed by the finite element 
method with mesh size 0.05 x 0.05. 





Figure 5: Decomposition result of Model I obtained by the local search based 
MMC with initial solution given by M^C . 
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1.5 
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0.5 

I 0 

-0.5 
-1 

-^ 4 . .. . .. 

Figure 6: Decomposition result of Model I obtained by the local search based 
M^C with random initialization. 



Model II due to the large difference between the empirical data distribution and 
the equilibrium distribution. It is interesting to see from Fig. that PCCA+ 
(with 20 discrete bins) also gives an undesired decomposition. We now analyze 
why PCCA+ fails in this example. Note that there is a low energy barrier at 
the center of the right potential well (see the rectangular region with the dashed 
line in Fig. and its magnified picture shown in Fig. |^, which can be easily 
crossed for equilibrium simulations. But the trajectory data used in the above 
is very far away from the global equilibrium due to short simulation lengths. 
So only a small number of transitions between discrete bins of PCCA+ are 
observed around this region, and the PCCA+ assigns these bins to different 
metastable states. In all the four methods, M^C is the only method that get 
correct metastable state decomposition by utilizing both the geometric and the 
dynamical information, and it avoids splitting the right potential well into two 
metastable states as PCCA+ because such a decomposition leads to a small 
margin between metastable boundaries. 

Finally, we repeat the above numerical experiments of Model I and Model 
II for 20 times and utilize the following quantities to evaluate and compare the 
performance of different decomposition methods quantitatively: 

1- Q = Sfe=i Pkk (At), where At denotes the sample interval and 
('^) = 1™ Pr(xt+r G metastable state jjxt G 

t—>-oo 

metastable state i) (25) 

denotes the transition probability between metastable states with lagtime 
r. It is clear that Q can measure the metastability of a system and a 
decomposition with strongly metastablity will result in a large Q close to 
K [32]. 

2. Implied timescale ITSi (r) = — t/ In Ai(T) with i > 1, where Ai (t) de¬ 
notes the i-th largest eigenvalue of transition probability matrix P (r) = 
[Pij (r)]. (The first implied timescale ITSi (t) = oo) It can be proved 
that the value of ITSi (t) is a constant independent of r and equal to the 
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(a) Potential function 


2 

1 




(b) Trajectory data, where dotted lines repre¬ 
sent trajectories of xt and circles denote data 
points sampled from trajectories. 


Figure 7: Illustration of Model II, where the rectangular with dashed lines 
shows the region [1.4,1.65] x [—0.5,0.5] of a low energy barrier within the right 
potential well. 


dominant relaxation timescales of the original system if the transitions 
between metastable states are exactly Markovian |12l ITH]. Thus, we can 
check if a Markov chain on metastable states can accurately approximate 
the system dynamics through comparing implied timescales with different 

T. 

Remark 11. We run a long simulation with time length 10^ to estimate values of 
Pij (t) for each model. Furthermore, for convenience of comparison, we use the 
finely discretized Markov state models with 50 states to estimate the true relax¬ 
ation timescales of Model I and Model II. The detailed estimation algorithms 
are given in jS]. 

Table [^ and Figs. [^ and El summarize the values of Q and implied timescales 
given by different decomposition methods, including the PCCA+ method with 
different bin numbers. The table and figures also demonstrate the superior 
performance of M^C. It is worth pointing out that in 9 out of 20 experiments 
of Model II, PCCA+ wrongly decomposes the right potential well into two 
metastable states (with any bin number), whereas M^C gives the “ideal” decom¬ 
position in all the 20 experiments. Moreover, as can be seen from the figures, 
the implied timescales obtained from M^C converge fast and are very close to 
the relaxation timescales estimated by finely discretized Markov state models, 
which implies that the essential dynamical properties of Model I and Model 
II can be accurately captured by 3-state Markov models using the metastable 
states identified by M^C. 

6.3. Molecular dynamics simulations 

We consider in this section the metastable state decomposition problem of 
molecular dynamics simulation models of alanine dipeptide and deca-alanine. 
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(c) MMC (d) PCCA+ (with 20 bins) 


Figure 8: Decomposition results of Model II, where white lines represent bound¬ 
aries between macrostates. The boundaries are computed by the finite element 
method with mesh size 0.05 x 0.05. 





Figure 9: PCCA+ decomposition result of Model II in the area [1.4,1.65] x 
[—0.5, 0.5], where where dotted lines represent trajectories, circles denote data 
points sampled from trajectories, solid lines show boundaries of discrete bins 
obtained by space discretization, and the upper boundary is chosen to be the 
metastable state boundary by the PCCA+ algorithm. 
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Table 3: Means and standard deviations of Q values calculated over 20 inde¬ 
pendent experiments of Models I and II 



t-medoids 

MMC 

PCCA 

(6 bins) 

PCCAi 
(10 bins) 

PCCA 
(20 bins) 

PCCAi 
(30 bins) 

PCCA- 

(40 bins) 

M-'^C 

Model I 

2.8138 ±0.0317 

2.8142 ±0.0244 

2.9603 ±0.0170 

2.9637± 0.0011 

2.9628 ±0.0017 2.9619 ± 0.0027 

2.9613 ±0.0021 

2.9641 ±0.0005 

Model II 

2.9579 ±0.0031 

2.9832±0.0116 

2.9681 ±0.0211 

2.9666 ±0.0161 

2.9676 ±0.0173 2.9655 ± 0.0187 

2.9600 ±0.0285 

2.9882 ±0.0024 


40 


^30 


M 20 


S 10 

a; 


— fc-medoids 
-b-MMC 

PCCA+ (6 bins) 
^PCCA+ 10 bins) 
^PCCA+ (20 bins) 
-^PCCA+ (30 bins) 
-S-PCCA+ (40 bins) 
.MSM _ 


2 4 

lagtime r 

(a) 


12 


^10 


T3 8 

CD ° 


S 6 

CO 

4 

o 

P 

44 <4 

CO 



^-M^C 

-4“ fc-medoids 

-b-MMC 

PCCA+ (6 bins) 
-4^PCCA+ (lO bins) 
PCCA+ (20 bins) 
-»-PCCA+ (30 bins) 
-<-PCCA+ (40 bins) 


mm 


2 4 

lagtime r 

(b) 


25r 


H 


20 


CD 

5 15 


T-3 

^ 4 r\ 

CD 10 

«-l-^ 

O 

§ 5- 
a; 


—fc-medoids 
-b-MMC 

PCCA+ (6 bins) 
^PCCA+ (10 bins) 
-5-PCCA+ (20 bins) 
-^PCCA+ 30 bins) 
^PCCA+ (40 bins) 
.MSM _ 

.. I mn it nnn*— 


2 4 

lagtime r 

(c) 


M^C 

fe-medoids 

MMC 

PCCA+ (6 bins) 
PCCA+ (10 bins) 
PCCA+ (20 bins) 
PCCA+ 30 bins) 
PCCA+ (40 bins) 



Figure 10: Means and standard deviations of estimated implied timescales of 
Model I obtained by different decomposition methods, where dotted lines indi¬ 
cate estimates of the second and third relaxation timescales of Model I computed 
by 50-state Markov state models. 
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Figure 11: Means and standard deviations of estimated implied timescales of 
Model II obtained by different decomposition methods, where dotted lines in¬ 
dicate estimates of the second and third relaxation timescales of Model II com¬ 
puted by 50-state Markov state models. (The implied timescale value is set to 
be zero if the corresponding eigenvalue of the transition probability matrix is 
zero or negative.) 
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Figure 12: Illustration of the structure of alanine dipeptide 


Alanine dipeptide (sequence acetyl-alanine-methylamide) is a small molecule 
which consists of two alanine amino acid units. The structural and dynamical 
properties of this molecule have been thoroughly studied, and its conformation 
space (phase space) can be conveniently described by two backbone dihedral 
angles ip and t/j (see Fig. 12). Deca-alanine is a small peptide composed of 
10 alanine residues, and its configuration can be described by 18 backbone 
dihedral angles. We perform twenty simulations of 200ns of molecular dynamics 
of alanine dipeptide with sample interval 20ps and six 500ns molecular dynamics 
simulations of deca-alanine with sample interval lOOps (The detailed simulation 
model is given in ESI). 


The metastable state decomposition methods are applied to each simulation 
trajectory (k = 3 for alanine dipeptide and k = 2 for deca-alanine), and the 
corresponding Q values and implied timescales are calculated from all the other 
trajectories of the same molecule. Moreover, considering the periodicity of an¬ 
gular data, we represent the molecular state as a vector x consisting of sin/cos 
of the dihedral angles in experiments. 

Fig. |13a| shows the potential function of alanine dipeptide in the space of 
(if, 'ijj) and the three metastable states which can be manually identified accord¬ 
ing to experience, and data points sampled from one simulation trajectory are 
displayed in Fig. |13b[ The decomposition results of the simulation trajectory 
are plotted in Fig. |14| As can be seen, Ic-medoids and MMC fail to indentify 
the metastable structure of alanine dipeptide, and the decompositions obtained 
by PCCA+ and M^C are consistent with the manual decomposition. Table 
and Fig. [^summarize Q values and implied timescales given by decomposition 
results of simulation trajectories of alanine dipeptide. It can be observed that 
PCCA+ and M^C performs significantly better than the geometric clustering 
methods fc-medoids and MMC. M^C achieves the similar average Q values and 
implied timescales as PCCA+, but the corresponding standard deviations of 
M^C are lower than that of PCCA+, which shows M^C has more stable perfor¬ 
mance for this molecular dynamics simulation model. 


The decomposition results of deca-alanine are displayed in Table and 


Fig. 16 The Q values obtained by M^C are significantly smaller than that 
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-3 -2-1 0 1 


(a) Potential function estimated from simu¬ 
lation trajectories, where dashed lines repre¬ 
sent boundaries between 3 metastable states 
which are manually identified. 



-3 -2-1 0 1 


‘P 

(b) Data points sampled from a simulation 
trajectory with sample interval At = 20ps. 


Figure 13: Illustration of the molecular dynamics simulation of alanine dipetide 


Table 4: Means and standard deviations of Q values calculated over 10 inde¬ 
pendent experiments of alanine dipeptide and 6 independent experiments of 
deca-alanine 



^:-medoids | 

MMC 

PCCA-f 1 

(100 bins) 

PCCA^ 1 

(200 bins) 

PCCA-f 1 

(300 bins) | 

PCCA-f 1 

(400 bins) 

PCCA4 1 

(500 bins) | 

M^C 

1 Hlniiine dipeptide I 

1.7655 ±0.0027 I 2.1521 ± 0.5060 I 

2.7360 ±0.0077 | 

2.7381 ±0.0018 1 

2.7367 ±0.0059 | 

2.7340 ±0.0037 | 

2.7328 ± 0.0052 | 

2.7397 ±0.0003 | 

dix'a-alanino 

1.8113 ±0.0502 1 

1.8819 ±0.0454 | 

1.9033 ±0.0292 

1.8904 ±0.0402 

1.8894 ±0.0815 

1.9227 ±0.0238 

1.8896 ±0.0764 

1.9592 ±0.0038 


given by the other methods. In contrast to alanine dipeptide, the kinetics of 
deca-alanine is much more complicated and it is difficult to accurately estimate 
the relaxation timescales. In |53], a lower bound 6.5ns for the second relax¬ 
ation timescale is given. Moreover, according to the variational principle m, 
the second implied timescale obtained from a metastable state decomposition 
is always smaller than the true one if there is no statistical noise. So we can 
conclude from Fig. 16 that M^C gives more accurate estimate of ITS 2 than the 
other methods for this molecular dynamics model. 


7. Conclusion 

Large margin methods have turned out to be an effective and robust ap¬ 
proach for supervised and unsupervised learning problems. In this paper, we 
apply the large margin principle to the metastable state decomposition problem, 
and propose a maximum margin metastable clustering (M^C) method to iden¬ 
tify metastable states of complex stochastic systems. The key step is to design a 
large margin metastable constraint by combining the metastability criterion 
and large margin criterion, where we assign a class label to each transition pair 
in trajectories instead of a single data point. Then the error of metastable state 
decomposition can be expressed by the misclassification loss function in large 
margin learning, and a lot of well developed computational techniques for large 














(b) fc-medoids 



(c) MMC 


(d) PCCA+ (with 200 bins) 


Figure 14: Decomposition results of the simulation model of alanine dipetide, 
where black lines represent boundaries between macrostates. The boundaries 
are computed by the finite element method with mesh size O.OStt x O.OStt. 
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Figure 15: Means and standard deviations of estimated implied timescales of 
alanine dipeptide obtained by different decomposition methods, where dotted 
lines indicate estimates of the second and third relaxation timescales of alanine 
dipeptide computed by 500-state Markov state models. 
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Figure 16: Means and standard deviations of estimated implied timescales of 
deca-alanine obtained by different decomposition methods. 


margin learning, such as kernel based feature mapping and convex relaxation, 
can be utilized. Moreover, we present a hybrid optimization algorithm which 
mixes global search and local search strategies to solve M^C problems with 
large-scale data sets. In contrast to previous metastable state decomposition 
methods including geometric clustering methods and kinetic clustering meth¬ 
ods, the M^C method can effectively utilize both the geometric information and 
the dynamical information provided by trajectories without pre-discretization 
in space, and our experimental analysis reveal that the M^C method yields 
more accurate and robust decomposition results than traditional geometric and 
kinetics clustering methods in most cases. 

The major drawback of M^C is that the computing burden will be heavy for 
very large data set, because it need to iteratively solve an SVM-like problem. 
In the future, we will use some modern SVM techniques such as Pegasos[58] 
and core vector machine |59| to improve the efficiency of M^C. Moreover, we 
will investigate how to extend our method to the problem of slow process de¬ 
composition of metastable systems imisn] by incorporating the distance metric 
learning technique El so that it can not only detect metastable states but also 
extract dominant dynamical features from simulation and experimental data. 
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Appendix A. Proof of equivalence between (|^ and 


Suppose first that (12) holds. Substituting {k,k) = {k,yn) and (fc,fc) = 


{yn,k) into (12), we can get 


“ ^L) ^ + {^y^vr. - hk) + ly„=fe=fe 

= - ^l) ^ (^") + - bk) + ly„=fc 


(A.l) 


> 1 


and 


('^Lyr. “ ^L) ^ + (^y-^y^ “ hk) + ly„=fe=fe 

= ~ ’^fc) ^ + (^y- “ h) + lj/„=fc 


(A.2) 


> 1 


So (121 is a sufficient condition for d^. 

We now show the necessity of (12) for ([^. Substituting k = k and k = k 
into the two inequalities in ([^, respectively, yields 


( 


w' — Wr 


'Vnyn “f {^y„=k + ly„=fc l) ^ 1 (-^-3) 

Note that 


1, yn = k_= k 

^yji—k~^^yn—k ^ {kjk'\jk ^ k 

— 1, otherwise 


and 


Therefore, 


^y^—h — k 


1, yn = k = k 
0, otherwise 


(A.4) 

(A.5) 


(’^Ly. “ ^L) ^ + (by„y„ “ &fcfe) + 1 


y^ — k = k 


^ {^hyr. - ^Ik) ^ “ h^ + (ly„=fe + ly„=fc - 1) 

> 1 


From the above, we can conclude that (121 is equivalent to ([^ . 

Appendix B. Proof of Theorem 


(A.6) 


Let us start with the case that all labels y° in (14) are given. In this case, the 
coarse-grained M^C problem (141 is reduced to a simple quadratic programming 
problem: 


min |jW||^-f c'''|'= 

s.t. Vn=l,...,A^, Vfc, fc = 1,..., K, 

('^ytyt “ ^ + lys=fc=fc ^ ^ 


(B.l) 
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For the sake of convenience, here we let denote a k dimensional vector with 
only the k-th element being 1 and others 0, and define a matrix B 


B 


ei 

62 • 

• Gk 6i 6i 


B 


ei 

62 • 

• 62 63 • 

' — 1 


(B.2) 


i.e., each column of B is an element of {(el,ej)‘''|i, j € [1 ,k]} and the first k 
columns of B is equal to [ I I ]^, where B,B G are two submatrices 

consisting of the first k and the last k rows of B. 

By introducing a dual variable A € 

Lagrangian of (B.l) can be written as 




and using (151 and (B.21, the 


c = i/3||W||VcT|"-tr(DTdiag(Al) (X + X) WT) 

-tr([ ATD GOT ])+tr(AT (XWTB + XWTB)) 

+ (1-|")TA1 (B.3) 


where X = (</>(x5),..., S 


pN^xd 


-xd 


and X = e 

Setting the derivatives of the Lagrangian (B.31 with respect to W and 
to zero, and adding the constraint A > 0, we obtain the following dual 
problem of (B.l): 


nmx ~^tr ([BAT BAt]K[BAt bat 
+ itr (DTC [I I ] K [ BAT bAt Y) 

-tr ([ AtD got ]) - ^tr (MCK'*C) + ITc 

s.t. A1 = 1, 

A > 0. 

where 

r ■ 

^ ^ k(2.2) 

K® = 

and 


(B.4) 


(B.5) 

(B.6) 


= XTX, ^ xTx, = X'^X, = X'^X (B.7) 


Putting 

have 


(B.41 into the form of a standard quadratic programming problem, we 


max — ^vec (A)"'^ Pvec (A) + qTvec (A) — ^tr (MCK®C) + ITc 
s.t. (IT (g) I) vec (A) = 1, (B-8) 

vec (A) > 0. 


where 


P = (BTB) (g) + (BTB) (g) 

+ (B'^B) (g)K(24) + (btb) (gK(2’2) (B.9) 
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and 


q = vec[ — 
1 


Q +K(2'i)y C^DB 

+ i +K(2'2)y c^DB-[ D OQT 


(B.IO) 

(The definition of vec (•) is given in the list of notation.) According to Lemma 


12 and the strong duality theorem [55], we can conclude that (B.l I has the same 


minimum with the following optimization problem: 

min i6/T6»-cTQ;-;Atr(MCK'’C) + lTc 

a,v,e ^ '■ 

s.t. q + (1 0 I) a + V + R0 = 0, 

V > 0. 

where R is a full column rank matrix satisfying 

ip = RRT 

(Note that R may not be a square matrix if P is not full rank.) 

Combining the equivalence between (B.l I and (B.lll and the proposition 
mentioned in Remark]^ the theorem is proved. 

Lemma 12. For a quadratic programming problem defined by 

min Ax + b^x 


(B.ll) 


(B.12) 


s.t. Ex = c, 

X > 0. 


(B.13) 

if A can be decomposed as A = RR’’’ with R being full column rank and there 
is an optimal solution to (B.131, then the minimum of (B.131 is equal to the 
maximum of the following problem: 

1 


max —fO'^O + c'^a 
a.v,e ^ 


s.t. 


bT-aTE-vT 
V > 0. 


Proof. The Lagrangian of ( |B.13 ) is 

1 


C (x, ot, v) = -x'''Ax + (b’’’ — cPFi — v’’’) X + a 


(B.14) 


(B.15) 


Then the dual problem of (B.131 can be written as 


with 


max 

5(a,v) 


q:,v 

(B.16) 

S.t. 

V > 0. 

,v) = 

inf C (x, ct, v) 

X 

(B.17) 

a,v) 

in different cases. 













33 


Algorithm 2 Optimization procedure for MMC 
1: generate a coarse-grained set with cardinality N‘^ and normalized weights 
c = (ci,..., catc)^ from S by the fc-medoids algorithm 
2: solve the MMC problem on S'^ by the SDP relaxation algorithm m to get 
class labels y'= = (yf,..., y%a) of 

3: calculate class labels = (yi,..., y| 5 |) of data points in S from y‘^ 

4: solve the MMC problem on S by the local search algorithm proposed in |33] 
starting from y = 


Algorithm 3 Optimization procedure for PCCA+ 

1: partition all the data into N'^ bins x^,..., x^c by the fc-medoids algorithm 
2: estimate the transition matrix P = [Pij] = [Pr(xt+At € x^|xt_|_At G x^)] by 
the maximum likelihood algorithm in |7] 

3: apply the Markov compression algorithm in |26| to lump the N‘^ bins into 
K metastable state. 


Case (i) There is a 0 such that 

bT -CcTE- vT =6/TRT 


(B.18) 


We have C (x, ct, v) = ^ (R^x)^ + 9'^ (R^x) + c^a and g (a, v) = 

-kO'^e + c'^a. 


Case (ii) There is no 6 satisfying (B.18 1 . It is easy to see that we can find an 
X such that R^x = 0 and (bT — oPPi — v‘'')x ^ 0. Then g (q:,v) = —oo. 


Combining the above results yields the conclusion of the lemma. 


□ 


Appendix C. Optimization procedure for MMC 

The optimization algorithm for solving MMC problems in our experiments 
is described by Algorithm It can be seen that this algorithm also combines 
global search and local search techniques through coarse graining like the algo¬ 
rithm for M^C proposed in this paper. 

Appendix D. Implementation procedure of PCCA-(- 

In this paper, we perform the PCCA+ clustering as shown in Algorithm]^ 

Appendix E. Description of Model I and Model II 

Model I is governed by the Fokker-Planck equation 



r 1 a 1 


V2 

dxj = - 

4 

l J 

Ui (x) dt + 

- 1 

ICM 

> 

CO 

_1 


dxj = 


(E.l) 
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where Xt = ) denotes the system state at time t, Ut is a two-dimensional 

Wiener process, and 


C/i (x) = —8 ^ exp ( — 8(x^^^—/ri 

(/ii,/i2)e{-i,o,i}x{-i,i} 


(- 8 (= 


-200 

4 


r(2) 




The stochastic differential equation of Model II is 


dx* = - - 


d 


a 

dx(^^ - 


Uu (x) dt 


-dut 

7 


with 


(E.2) 


(E.3) 


Unix) = -dyexp ^-16 ^(||x|| - 1.6)^-hmax|6>-|,o| 
-47 exp (^-0.8 (x^i) 1 )^ - 32 (x^^^ - 

-47 exp (^-0.8 (x^i) -h 1 ) ^ - 32 (x^^) O.s) 


(E.4) 


6 = atan2 (x*^^\ x^^^) and 7 = 1.67. It is easy to verify that Model I and Model 
II are time-reversible diffusion processes, and their equilibrium distributions are 
TT (x) oc exp {—Ui (x)) and tt (x) oc exp {—Un (x)) separately. 

Moreover, we utilize the Euler-Maruyama method [53] to solve (E.ll and 
(E.21 in this paper. 


References 

[1] H. Wu, Maximum margin clustering for state decomposition of metastable 
systems, in: I. Rojas, G. Joya, J. Cabestany (Eds.), Proceedings of the 12th 
International Work-Conference on Artificial Neural Networks, Vol. 7902 of 
Lecture Notes in Computer Science, Springer, Tenerife, Spain, 2013, pp. 
556-565. 

[2] F. Noe, S. Fischer, Transition networks for modeling the kinetics of confor¬ 
mational change in macromolecules. Current opinion in structural biology 
18 (2) (2008) 154-162. 

[3] T. Biancalani, T. Rogers, A. McKane, Noise-induced metastability in bio¬ 
chemical networks. Physical Review E 86 (1) (2012) 010106. 








REFERENCES 


35 


[4] N. Berglund, B. Gentz, Metastability in simple climate models: Pathwise 
analysis of slowly driven langevin equations, Stochastics and Dynamics 
2 (03) (2002) 327-356. 

[5] F. Noe, 1. Horenko, C. Schiitte, J. Smith, Hierarchical analysis of conforma¬ 
tional dynamics in biomolecules: Transition networks of metastable states. 
Journal of Chemical Physics 126 (2007) 155102. 

[6] C. R. Schwantes, V. S. Pande, Improvements in markov state model con¬ 
struction reveal many non-native interactions in the folding of ntl9. Journal 
of chemical theory and computation 9 (4) (2013) 2000-2009. 

[7] J. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. Chodera, 
C. Schiitte, F. Noe, Markov models of molecular kinetics: Generation and 
validation. Journal of Chemical Physics 134 (2011) 174105. 

[8] R. Aldhaheri, H. Khalil, Aggregation and optimal control of nearly com¬ 
pletely decomposable Markov chains, in: Proceedings of the 28th IEEE 
Conference on Decision and Control, IEEE, 1989, pp. 1277-1282. 

[9] J. Chodera, W. Swope, J. Pitera, K. Dill, Long-time protein folding dynam¬ 
ics from short-time molecular dynamics simulations. Multiscale Modeling 
& Simulation 5 (4) (2006) 1214-1226. 

[10] M. Sarich, F. Noe, C. Schiitte, On the approximation quality of Markov 
state models, SIAM Multiscale Model. Simul. 8 (4) (2010) 1154-1177. 

[11] J. D. Chodera, F. Noe, Markov state models of biomolecular conformational 
dynamics. Current Opinion in Structural Biology 25 (2014) 135-144. 

[12] F. Noe, H. Wu, J.-H. Prinz, N. Plattner, Projected and hidden markov 
models for calculating kinetics and metastable states of complex molecules. 
The Journal of chemical physics 139 (18) (2013) 184114. 

[13] N. Groningen, Essential dynamics of reversible peptide folding: memory- 
free conformational dynamics governed by internal hydrogen bonds. Jour¬ 
nal of Molecular Biology 309 (1) (2001) 299-313. 

[14] W. Swope, J. Pitera, F. Suits, M. Pitman, M. Eleftheriou, B. Fitch, R. Ger¬ 
main, A. Rayshubski, T. Ward, Y. Zhestkov, R. Zhou, Describing protein 
folding kinetics by molecular dynamics simulations. 2. example applications 
to alanine dipeptide and a /3-hairpin peptide. Journal of Physical Chemistry 
B 108 (21) (2004) 6582-6594. 

[15] E. Sorin, V. Pande, Exploring the helix-coil transition via all-atom equilib¬ 
rium ensemble simulations. Biophysical Journal 88 (4) (2005) 2472-2493. 

[16] S. Elmer, S. Park, V. Pande, Foldamer dynamics expressed via markov 
state models. H. state space decomposition, Journal of chemical physics 
123 (2005) 114903. 



REFERENCES 


36 


[17] O. M. Becker, Geometric versus topological clustering: an insight into 
conformation mapping, Proteins: Structure, Function, and Bioinformat¬ 
ics 27 (2) (1997) 213-226. 

[18] X. Daura, W. F. van Gunsteren, A. E. Mark, Folding-unfolding thermody¬ 
namics of a /3-heptapeptide from equilibrium simulations. Proteins: struc¬ 
ture, function, and bioinformatics 34 (3) (1999) 269-280. 

[19] D. Ghema, A. Goldblum, The “nearest single neighbor” method finding 
families of conformations within a sample. Journal of chemical information 
and computer sciences 43 (1) (2003) 208-217. 

[20] A. Glattli, D. Seebach, W. F. van Gunsteren, Do valine side chains have an 
influence on the folding behavior of /3-substituted /3-peptides?, Helvetica 
chimica acta 87 (10) (2004) 2487-2506. 

[21] J. Shao, S. W. Tanner, N. Thompson, T. E. Gheatham, Glustering molec¬ 
ular dynamics trajectories: 1. characterizing the performance of different 
clustering algorithms. Journal of Ghemical Theory and Gomputation 3 (6) 
(2007) 2312-2334. 

[22] Y. Yao, J. Sun, X. Huang, G. R. Bowman, G. Singh, M. Lesnick, L. J. 
Guibas, V. S. Pande, G. Garlsson, Topological methods for exploring low- 
density states in biomolecular folding pathways, Journal of chemical physics 
130 (2009) 144115. 

[23] B. Keller, X. Daura, W. F. van Gunsteren, Gomparing geometric and ki¬ 
netic cluster algorithms for molecular simulation data. Journal of chemical 
physics 132 (7) (2010) 074110. 

[24] F. Noe, G. Schiitte, E. Vanden-Eijnden, L. Reich, T. R. Weikl, Gonstructing 
the equilibrium ensemble of folding pathways from short off-equilibrium 
simulations. Proceedings of the National Academy of Sciences 106 (45) 
(2009) 19011-19016. 

[25] P. Deuflhard, W. Huisinga, A. Fischer, G. Schiitte, Identification of almost 
invariant aggregates in reversible nearly uncoupled markov chains. Linear 
Algebra and its Applications 315 (1) (2000) 39-59. 

[26] P. Deuflhard, M. Weber, Robust perron cluster analysis in conformation 
dynamics. Linear algebra and its applications 398 (2005) 161-184. 

[27] V. Mehrmann, D. Szyld, E. Virnik, An SVD approach to identifying 
metastable states of Markov chains. Electronic Transactions on Numeri¬ 
cal Analysis 29 (2008) 46-69. 

[28] A. Jain, G. Stock, Identifying metastable states of folding proteins. Journal 
of Ghemical Theory and Gomputation 8 (10). 



REFERENCES 


37 


[29] G. R. Bowman, Improved coarse-graining of markov state models via ex¬ 
plicit consideration of statistical uncertainty, Journal of chemical physics 
137 (2012) 134111. 

[30] E. H. Kellogg, O. F. Lange, D. Baker, Evaluation and optimization of 
discrete state models of protein folding. Journal of Physical Chemistry B 
116 (37) (2012) 11405-11413. 

[31] R. T. McGibbon, C. R. Schwantes, V. S. Pande, Statistical model selection 
for markov models of biomolecular dynamics. Journal of Physical Chem¬ 
istry B. 

[32] J. Chodera, N. Singhal, V. Pande, K. Dill, W. Swope, Automatic discovery 
of metastable states for the construction of markov models of macromolec- 
ular conformational dynamics. Journal of Chemical Physics 126 (2007) 
155101. 

[33] K. Crammer, Y. Singer, On the algorithmic implementation of multiclass 
kernel-based vector machines. Journal of Machine Learning Research 2 
(2001) 265-292. 

[34] L. Xu, Convex large margin training techniques: Unsupervised, semi- 
supervised, and robust support vector machines, Ph.D. thesis. University 
of Waterloo, Waterloo, Ontario, Canada (2007). 

[35] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. 

[36] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, 1. Guyon, L. D. Jackel, 
Y. LeCun, U. A. Muller, E. Sackinger, P. Simard, V. Vapnik, Comparison 
of classifier methods: a case study in handwriting digit recognition, in: 
Proceedings of the 12th International Conference on Pattern Recognition, 
Vol. 2, IEEE, IEEE Computer Society Press, 1994, pp. 77-82. 

[37] J. Friedman, Another approach to polychotomous classifcation. Tech, rep.. 
Department of Statistics, Stanford University (1996). 

[38] E. L. Allwein, R. E. Schapire, Y. Singer, Reducing multiclass to binary: 
A unifying approach for margin classifiers. Journal of Machine Learning 
Research 1 (2001) 113-141. 

[39] L. Xu, J. Neufeld, B. Larson, D. Schuurmans, Maximum margin clustering. 
Advances in Neural Information Processing Systems 17 (2004) 1537-1544. 

[40] L. Xu, D. Schuurmans, Unsupervised and semi-supervised multi-class sup¬ 
port vector machines, in: Proceedings of the National Conference on Arti¬ 
ficial Intelligence, Vol. 20, AAAI, 2005, p. 904. 

[41] H. Valizadegan, R. Jin, Generalized maximum margin clustering and un¬ 
supervised kernel learning, in: Advances in Neural Information Processing 
Systems, Vol. 19, 2006, pp. 1417-1424. 



REFERENCES 


38 


[42] B. Zhao, F. Wang, C. Zhang, Efficient multiclass maximum margin clus¬ 
tering, in: Proceedings of the 25th International Conference on Machine 
learning, ACM, 2008, pp. 1248-1255. 

[43] K. Zhang, 1. Tsang, J. Kwok, Maximum margin clustering made practical, 
IEEE Transactions on Neural Networks 20 (4) (2009) 583-596. 

[44] A. Rahimi, B. Recht, Random features for large-scale kernel machines, in: 
J. Platt, D. Roller, Y. Singer, S. Roweis (Eds.), Advances in Neural In¬ 
formation Processing Systems, Vol. 20, MIT Press, Cambridge, MA, 2008, 
pp. 1177-1184. 

[45] N. Pham, R. Pagh, Fast and scalable polynomial kernels via explicit feature 
maps, in: Proceedings of the 19th ACM SIGKDD International Conference 
on Knowledge Discovery and Data Mining, ACM, New York, 2013, pp. 239- 
247. 

[46] N. Kwak, Nonlinear projection trick in kernel methods: An alternative 
to the kernel trick, IEEE Transactions on Neural Networks and Learning 
Systems 24 (12) (2013) 2113-2119. 

[47] R. Horn, C. Johnson, Matrix analysis, Cambridge university press. New 
York, 1988. 

[48] K. Genova, V. Guliashki, Linear integer programming methods and 
approaches-a survey. Cybernetics And Information Technologies 11 (1). 

[49] MOSEK ApS, Mosek: High performance software for large-scale LP, QP, 
SOCP, SDP and MIP including interfaces to C, Java, MATLAB, .NET, R 
and Python, version 7.0, http://www.mosek.com (2012). 

[50] Gurobi Optimization Inc., Gurobi optimizer: State-of-the-art mathematical 
programming solver, version 5.6, http://www.gurobi.com/ (2014). 

[51] J. Pryor, J. W. Chinneck, Faster integer-feasibility in mixed-integer linear 
programs by branching to force change. Computers & Operations Research 
38 (8) (2011) 1143-1152. 

[52] T. Hastie, R. Tibshirani, J. J. H. Friedman, The elements of statistical 
learning. Springer, New York, 2001. 

[53] M. Grant, S. Boyd, CVX: Matlab software for disciplined convex program¬ 
ming, version 2.0 beta, http://cvxr.com/cvx (2013). 

[54] M. Weber, Improved Perron cluster analysis. Tech. Rep. ZIB-Report 03-04, 
Konrad-Zuse-Zentrum fiir Informationstechnik Berlin (2003). 

[55] A. Asuncion, D. Newman, UCI machine learning repository, http://www. 
ics. uci . edu/$\sim$mlearn/-[MLR}epository. html (2007). 



REFERENCES 


39 


[56] F. Niiske, B. G. Keller, A. S. M. Guillermo Perez-Hernandez, F. Noe, Vari¬ 
ational approach to molecular kinetics, submitted to Journal of Ghemical 
Theory and Gomputation (2013). 

[57] F. Noe, F. Niiske, A variational approach to modeling slow processes in 
stochastic dynamical systems, SIAM Multiscale Model. Simul. 11 (2) (2013) 
635-655. 

[58] S. Shalev-Shwartz, Y. Singer, N. Srebro, A. Gotter, Pegasos: Primal es¬ 
timated sub-gradient solver for svm. Mathematical programming 127 (1) 
(2011) 3-30. 

[59] I. W. Tsang, J. T. Kwok, P.-M. Gheung, Gore vector machines: Fast svm 
training on very large data sets, in: Journal of Machine Learning Research, 
2005, pp. 363-392. 

[60] G. Perez-Hernandez, F. Paul, T. Giogino, G. de Fabritiis, F. Noe, Identifi¬ 
cation of slow molecular order parameters for markov model construction. 
Journal of Ghemical Physics 139 (2013) 015102. 

[61] A. Bellet, A. Habrard, M. Sebban, A survey on metric learning for fea¬ 
ture vectors and structured data, GoRR:abs/1306.6709, http://arxiv. 
org/abs/1306.6709 (2013). 

[62] S. Boyd, L. Vandenberghe, Gonvex Optimization, Gambridge University 
Press, New York, 2004. 

[63] P. E. Kloeden, R. Pearson, Numerical Solution of Stochastic Differential 
Equations, Springer, Berlin, 1992. 



